class: inverse, middle, center

Intro to Data


How is data stored, how do we use it?

Our example data:

Download data csv file link and pay attention to where it downloads on your computer

  • Make sure it is a .csv file and not a “web archive” or something else.

Open the data file penguins.csv and look at it

  • What are the columns? What are the rows?

About the penguins data


Workflow - Keep it together!

Steps for a new data analysis project or homework:

  1. Create a folder to contain all your files.
  2. Move data file (penguins.csv) into this folder.
  3. Create an Rstudio project inside this folder. (next slides)
  4. Create a new Rmd for your analyses/homework.

Do steps 1 & 2 now!


R Projects (.Rproj file) & Good Practices

Use projects to keep everything together (read this) - A project keeps track of your coding environment and file structure. - Create an RStudio project for each data analysis project, for each homework assignment, etc. - A project is associated with a directory folder + Keep data files there + Keep code scripts there; edit them, run them in bits or as a whole + Save your outputs (plots and cleaned data) there - Only use relative paths, never absolute paths + relative (good): read.csv("data/mydata.csv") + absolute (bad): read.csv("/home/yourname/Documents/stuff/mydata.csv")

Advantages of using projects - standardizes file paths - keep everything together - a whole folder can be easily shared and run on another computer - when you open the project everything is as you left it


Create a new R project

Let’s go through it together. (Read this for more)

.pull-left-60[ - Click in top left or File -> New Project - Click Existing Directory - Browse to your folder with the data - Optional Click “Open in new session checkbox” - Click “Create project”] .pull-right-40[

]

Bonus lessons


The data file will be in your Files pane:

and your workspace folder location will be showing at the top (i.e. Home/Desktop/workshop_practice)


Data in R/Rstudio

Open penguins.csv in Rstudio and look at it

We will show you how to store and use this data in R as a data frame

Currently it is still just a file in your folder.


To run and save your code: Create a new Rmd!


Load the packages we need in the Rmd

Add this code to the setup chunk in the Rmd and run that chunk:

.pull-left[

library(tidyverse)
## ── Attaching packages ────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
] .pull-right[

]

Now we can use functions in these packages, such as read_csv() and %>% and mutate() and tabyl()

Remove everything in the Rmd below this code

  • Loading library code should always be at the top of your Rmd so you can use these packages in code “lower down”

Load the data set into R

penguins <- read_csv("penguins.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character(),
##   year = col_double()
## )
## Parsed with column specification:
## cols(
##   id = col_double(),
##   age = col_character(),
##   sex = col_character(),
##   grade = col_character(),
##   race4 = col_character(),
##   bmi = col_double(),
##   weight_kg = col_double(),
##   text_while_driving_30d = col_character(),
##   smoked_ever = col_character(),
##   bullied_past_12mo = col_logical()
## )
# Run in console:
View(penguins) 
# Can also view the data by clicking on its name in the Environment tab


Your Rmd should look something like this:

Try knitting it!


Load a data set: bonus lessons


class: inverse, middle, center

Object types


Data frames (aka “tibbles” in tidyverse)

.pull-left-60[ Vectors vs. data frames: a data frame is a collection (or array or table) of vectors

penguins
## # A tibble: 342 x 9
##       id species island bill_length_mm bill_depth_mm flipper_length_…
##    <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>
##  1  1689 Adelie  Torge…           39.1          18.7              181
##  2  4274 Adelie  Torge…           NA            17.4              186
##  3  4539 Adelie  Torge…           40.3          18                195
##  4  2435 Adelie  Torge…           36.7          19.3              193
##  5  2326 Adelie  Torge…           39.3          20.6              190
##  6  2637 Adelie  Torge…           38.9          17.8              181
##  7  4443 Adelie  Torge…           NA            19.6              195
##  8  2102 Adelie  Torge…           34.1          18.1              193
##  9  2975 Adelie  Torge…           42            20.2              190
## 10  3966 Adelie  Torge…           37.8          17.1              186
## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>,
## #   year <dbl>

] .pull-right-40[


Variable (column) types

type description
double/numeric numbers that are decimals
character text, “strings”
integer integer-valued numbers
factor categorical variables stored with levels (groups)
logical boolean (TRUE, FALSE)

Data structure

glimpse(penguins)   # structure of data
## Rows: 342
## Columns: 9
## $ id                <dbl> 1689, 4274, 4539, 2435, 2326, 2637, 4443, 2102, 297…
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "…
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen",…
## $ bill_length_mm    <dbl> 39.1, NA, 40.3, 36.7, 39.3, 38.9, NA, 34.1, 42.0, 3…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.…
## $ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 1…
## $ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 425…
## $ sex               <chr> "male", "female", "female", "female", "male", "fema…
## $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

Data set summary

summary(penguins)
##        id         species             island          bill_length_mm 
##  Min.   :1001   Length:342         Length:342         Min.   :32.10  
##  1st Qu.:2031   Class :character   Class :character   1st Qu.:39.45  
##  Median :2984   Mode  :character   Mode  :character   Median :44.70  
##  Mean   :3031                                         Mean   :44.00  
##  3rd Qu.:4073                                         3rd Qu.:48.52  
##  Max.   :4969                                         Max.   :59.60  
##                                                       NA's   :6      
##  bill_depth_mm   flipper_length_mm  body_mass_g       sex           
##  Min.   :13.10   Min.   :172.0     Min.   :2700   Length:342        
##  1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550   Class :character  
##  Median :17.30   Median :197.0     Median :4050   Mode  :character  
##  Mean   :17.15   Mean   :200.9     Mean   :4202                     
##  3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750                     
##  Max.   :21.50   Max.   :231.0     Max.   :6300                     
##                                                                     
##       year     
##  Min.   :2007  
##  1st Qu.:2007  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2009  
##  Max.   :2009  
## 

Show (print) whole data frame

Tibble truncates the output to ten rows, so you can’t actually see it all.

penguins
## # A tibble: 342 x 9
##       id species island bill_length_mm bill_depth_mm flipper_length_…
##    <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>
##  1  1689 Adelie  Torge…           39.1          18.7              181
##  2  4274 Adelie  Torge…           NA            17.4              186
##  3  4539 Adelie  Torge…           40.3          18                195
##  4  2435 Adelie  Torge…           36.7          19.3              193
##  5  2326 Adelie  Torge…           39.3          20.6              190
##  6  2637 Adelie  Torge…           38.9          17.8              181
##  7  4443 Adelie  Torge…           NA            19.6              195
##  8  2102 Adelie  Torge…           34.1          18.1              193
##  9  2975 Adelie  Torge…           42            20.2              190
## 10  3966 Adelie  Torge…           37.8          17.1              186
## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>,
## #   year <dbl>

View whole data frame

We showed this already, very handy to see all data. Run in console since it’s more interactive.

View(penguins)

Data set info

.pull-left-40[

dim(penguins)
## [1] 342   9
nrow(penguins)
## [1] 342
ncol(penguins)
## [1] 9

]

.pull-right-60[

names(penguins)
## [1] "id"                "species"           "island"           
## [4] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
## [7] "body_mass_g"       "sex"               "year"

]


View the beginning of a data set

head(penguins)
## # A tibble: 6 x 9
##      id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1  1689 Adelie  Torge…           39.1          18.7              181        3750
## 2  4274 Adelie  Torge…           NA            17.4              186        3800
## 3  4539 Adelie  Torge…           40.3          18                195        3250
## 4  2435 Adelie  Torge…           36.7          19.3              193        3450
## 5  2326 Adelie  Torge…           39.3          20.6              190        3650
## 6  2637 Adelie  Torge…           38.9          17.8              181        3625
## # … with 2 more variables: sex <chr>, year <dbl>

View the end of a data set

tail(penguins)
## # A tibble: 6 x 9
##      id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1  1947 Chinst… Dream            45.7          17                195        3650
## 2  4452 Chinst… Dream            55.8          19.8              207        4000
## 3  2420 Chinst… Dream            43.5          18.1              202        3400
## 4  4861 Chinst… Dream            49.6          18.2              193        3775
## 5  4865 Chinst… Dream            50.8          19                210        4100
## 6  4162 Chinst… Dream            50.2          18.7              198        3775
## # … with 2 more variables: sex <chr>, year <dbl>

Specify how many rows to view at beginning or end of a data set

head(penguins, 3)
## # A tibble: 3 x 9
##      id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1  1689 Adelie  Torge…           39.1          18.7              181        3750
## 2  4274 Adelie  Torge…           NA            17.4              186        3800
## 3  4539 Adelie  Torge…           40.3          18                195        3250
## # … with 2 more variables: sex <chr>, year <dbl>
tail(penguins, 1)
## # A tibble: 1 x 9
##      id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1  4162 Chinst… Dream            50.2          18.7              198        3775
## # … with 2 more variables: sex <chr>, year <dbl>

Data frame cells, rows, or columns (rarely used)

.pull-left-60[

Specific cell: DatSetName[row#, column#]

# Second row, Third column
penguins[2, 3]
## # A tibble: 1 x 1
##   island   
##   <chr>    
## 1 Torgersen

Entire row: DatSetName[row#, ]

# Second row
penguins[2,]
## # A tibble: 1 x 9
##      id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <dbl> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1  4274 Adelie  Torge…             NA          17.4              186        3800
## # … with 2 more variables: sex <chr>, year <dbl>

]

.pull-right-40[ Entire col: DatSetName[, column#]

# Third column
penguins[, 3]
## # A tibble: 342 x 1
##    island   
##    <chr>    
##  1 Torgersen
##  2 Torgersen
##  3 Torgersen
##  4 Torgersen
##  5 Torgersen
##  6 Torgersen
##  7 Torgersen
##  8 Torgersen
##  9 Torgersen
## 10 Torgersen
## # … with 332 more rows

]